Exploratory analysis using machine learning of predictive factors for falls in type 2 diabetes

We aimed to investigate the status of falls and to identify important risk factors for falls in persons with type 2 diabetes (T2D) including the non-elderly. Participants were 316 persons with T2D who were assessed for medical history, laboratory data and physical capabilities during hospitalization and given a questionnaire on falls one year after discharge. Two different statistical models, logistic regression and random forest classifier, were used to identify the important predictors of falls. The response rate to the survey was 72%; of the 226 respondents, there were 129 males and 97 females (median age 62 years). The fall rate during the first year after discharge was 19%. Logistic regression revealed that knee extension strength, fasting C-peptide (F-CPR) level and dorsiflexion strength were independent predictors of falls. The random forest classifier placed grip strength, F-CPR, knee extension strength, dorsiflexion strength and proliferative diabetic retinopathy among the 5 most important variables for falls. Lower extremity muscle weakness, elevated F-CPR levels and reduced grip strength were shown to be important risk factors for falls in T2D. Analysis by random forest can identify new risk factors for falls in addition to logistic regression.

www.nature.com/scientificreports/ on the factors for which they are adjusted. To comprehensively evaluate risk factors for falls and to identify those with the highest contribution to falls are important because interventions to effectively prevent falls should be targeted to groups at the greatest risk. However, as the number of predictors increases, especially with a small number of study participants, traditional statistical models such as logistic regression might be unsuitable for identifying important variables as predictors 14,15 . A simulation study on logistic regression found that when the number of events per variable was less than 10, the regression coefficients were biased 15 . This association was also observed for stepwise logistic regression in another simulation study 14 . One of the more appropriate approaches is the nonlinear and nonparametric random forest algorithm, which is one type of machine learning based on the ensemble learning method 16 . In random forest a large number of decision trees are created using bootstrapped data, where a splitting variable at each node is selected from a randomly chosen subset of predictors. Final prediction is determined by a majority vote in the tree ensemble. By examining risk factors for falls using two different methods, logistic regression and random forest algorithm, the risks may be analyzed from different perspectives. Individuals who use walking aids or have reduced walking ability and activities of daily living are at a very high risk of falls as their physical functions such as muscle strength, balance capability and cognition are already reduced. The goal of our study was to identify individuals with type 2 diabetes who are independent in activities of daily living but are at high risk for falls so that they can be provided with appropriate interventions to prevent falls. In this context, we collected data on muscle strength, balance capability, body composition, diabetic microvascular and macrovascular complications, laboratory examinations and medications and aimed to identify important predictors of falls in those with type 2 diabetes who have been independent in daily life using traditional logistic regression and random forest.

Results
The collection rate of the questionnaire was 72% (226/316, 129 men and 97 women, median age 62 years (49,68) (Fig. 1). There were no missing data. Forty-four of the 226 participants had a fall in the year after hospital discharge, for a fall rate of 19%.
The faller group had a significantly higher rate of females, impaired vibratory sensation, proliferative retinopathy and history of stroke and higher levels of fasting C-peptide (F-CPR) compared with the non-faller group (Table 1). However, age, duration of diabetes, DPN and nephropathy were not significantly different between the two groups (Table 1). Although medications both on admission and at discharge were not significantly different between groups, the proportion of sulfonylurea users tended to be higher in the faller group than in the non-faller group (Table 2).
Physical activity during the hospitalization period, body composition and physical capabilities of participants are shown in Table 3. Physical activities did not differ significantly between the faller group and the non-faller  www.nature.com/scientificreports/ group. Skeletal muscle percentage in the faller group was significantly lower than in the non-faller group. On the other hand, skeletal muscle mass index (SMI) and body fat percentage were not significantly different between groups. Variables related to muscle strength including knee extension strength, knee extension endurance, dorsiflexion strength of the ankle joint, toe pinch force and grip strength in the faller group were significantly inferior to those in the non-faller group. For balance capabilities, IPS levels in the faller group were significantly lower than in the non-faller group. Other balance capability-related variables, including mIPS and one-leg standing time, tended to have lower values in the faller group. We performed logistic regression and a random forest classifier with sex, F-CPR level, impaired vibratory sensation, proliferative retinopathy, past history of stroke, taking sulfonylureas at discharge, skeletal muscle percentage, body fat percentage, knee extension strength, knee extension endurance, dorsiflexion strength, toe pinch force, grip strength, IPS and one-leg standing time as covariates in order to identify important risk factors for falls. Logistic regression showed that knee extension strength, F-CPR level and dorsiflexion strength were independent predictors of falls (Table 4). In the random forest, although the selected features changed somewhat in accordance with the training dataset, grip strength, F-CPR levels, knee extension strength, dorsiflexion strength and proliferative retinopathy were the 5 most important variables related to falls selected consistently multiplied by 10 and with tenfold cross validation (Table 5). During 10 random forest runs, explanatory variables were selected 10 times for grip strength and F-CPR, 7 times for knee extension muscle strength, 2 times for ankle dorsiflexion muscle strength and 3 times for proliferative retinopathy. In the ROC analyses, AUC, accuracy, sensitivity and specificity of the logistic regression were 0.75, 0.77, 0.52 and 0.82, respectively, and those of the random forest classifier were according to the average of 10 times and 10 cross validations 0.74 (SD = 0.01), 0.68 (SD = 0.02), 0.77 (SD = 0.04) and 0.66 (SD = 0.03), respectively.

Discussion
We conducted a follow-up questionnaire survey of falls that occurred during the first year after discharge from hospital in persons with type 2 diabetes and analyzed data using machine learning to investigate the relationship between various indices measured during hospitalization and falls after one year. There were three significant findings. First, 19% of the participants experienced falls within one year after discharge. Second, muscle strengthrelated variables were the most important predictors of falls according to both logistic regression and random forest classifier. Third, the F-CPR level was an important predictor of falls, and, to the best of our knowledge, is a novel finding in terms of risk factors for falls. www.nature.com/scientificreports/ www.nature.com/scientificreports/ The annual rate of falls was reported to range from 17 to 32% in community-dwelling older adults 1,8,17,18 and 22-40% in older adults with diabetes 10,11 . In the current study, 19% of participants experienced falls within one year after discharge, even though participants were younger with a median age of 62 years compared with previous reports. This indicates that early measures are needed to prevent falls in persons with type 2 diabetes.
In this study, compared with the non-faller group, the faller group had a significantly higher proportion of females and higher incidence of impaired vibratory sensation, proliferative retinopathy and stroke. F-CPR levels were also significantly higher and values for skeletal muscle percentages, muscle strength-related variables and balance capability-related variables were significantly lower compared with the non-faller group. To investigate the important risk factors for falls, we used logistic regression and random forest 19 . Most of the important variables were shared between these analyses, including knee extension strength, F-CPR level and dorsiflexion strength. These results differ from risks for falls shown in previous studies, which may have been influenced by differences in the explanatory variables, condition of study participants and the methods of analysis. Our study participants were younger and had better physical function than those in previous studies.
A meta-analysis showed that muscle weakness in the upper or lower extremities was associated with future falls 20 . Persons with type 2 diabetes, especially those with diabetic polyneuropathy, were shown to have decreased muscle mass and strength in lower limb extremities compared with non-diabetic control participants [21][22][23] . Although the current study did not have non-diabetic controls, participants had knee extension strength approximately 30% lower than middle-aged obese non-diabetic Japanese in a previous study 24 . Also shown was that muscle strength of knee extensors and the ankle plantar flexor in type 2 diabetic participants were about 30% lower than in healthy control participants 21 . In a cross-sectional study of aging and muscle function in a general population in Belgium, the mean knee extensor muscle strength in the group aged > 70 years was about 30% lower in men and 40% lower in women compared with the group aged 50-60 years 25 . These data suggest that lower body muscle strength in diabetic individuals declined faster than in healthy individuals, which is a major reason for the high prevalence of falls in diabetic individuals.
Grip strength and proliferative diabetic retinopathy were selected as important risk factors only in random forest. In particular, grip strength was the most important variable for falls in random forest, whereas knee extension strength was the most important variable in logistic regression. Since grip strength was reported to correlate with muscle mass or strength in various regions of the body 26,27 , as in the present study, certain predictive models may reflect the risk of falls more strongly than individual measures of lower extremity muscle strength, such as knee extension strength and dorsiflexion strength. Since poor vision is a risk factor for falls 28 , it is reasonable that proliferative retinopathy was selected by random forest in this study.
Unexpectedly, the F-CPR level was an important risk factor for falls with both logistic regression and the random forest classifier showing that F-CPR was among the most important risk factors for falls. Although those results cannot indicate a causal relationship, insulin resistance may be associated with falls. The Third National Health and Nutrition Examination Survey (NHANES III) showed that the elevated homeostasis model assessment of insulin resistance (HOMA-IR) was independently associated with lower SMI 29 . Furthermore, it was reported that abdominal circumference and metabolic syndrome, which are associated with insulin resistance, are associated with the risk of falls 30 . However, F-CPR levels were independent of lower limb muscle strength in this study. Since insulin resistance in skeletal muscle has been reported to inhibit skeletal muscle proliferation 31 , Table 4. Multivariate association of fall-related variables with clinical parameters in stepwise logistic regression analysis. Covariates: sex, fasting C-peptide levels, impaired vibratory sensation, presence of proliferative retinopathy, past history of stroke, taking sulfonylureas at discharge, skeletal muscle percentage, body fat percentage, knee extension strength, knee extension endurance, dorsiflexion strength, toe pinch force, grip strength, index of postural stability, and one-leg standing time.  www.nature.com/scientificreports/ high serum F-CPR may be associated with lower future skeletal muscle mass and skeletal muscle strength.
Another possibility regarding the link between falls and F-CPR is that insulin resistance increases the risk of falls by affecting cognitive function. Insulin resistance was shown to be associated with a cognitive decline in several prospective studies [32][33][34][35] . Decreased executive function was associated with the occurrence of falls 36 , which is impaired in persons with type 2 diabetes beginning in middle age.
In the current study, balance capability, diabetic polyneuropathy and insulin treatment were less important risk factors for falls than muscle strength and F-CPR.
Whether diabetic neuropathy and balance capability are independent risk factors for falls has not been concluded 10 . Because lower limb muscle strength, balance capability and peripheral neuropathy are interrelated, and because measurement methods and participants' backgrounds vary among studies, it is difficult to assess the extent of the impact of these risk factors on falls in diabetic patients. However, the previous reports that showed a significant association between falls and DPN or balance capability did not assess lower extremity muscle strength and included participants with more severe DPN compared with studies that did not show such associations 9,10,37,38 . In addition, the reports that evaluated both balance capability and DPN showed that balance capability was a stronger risk factor for falls than DPN 10,11 . The participants in the current study may have had less severe DPN than those in the study that found that DPN was an independent risk factor for falls. Furthermore, since the median (25th percentile, 75th percentile) age of the current study population was 62 (49, 68) years, which is younger than in previous reports, it is possible that their balancing capabilities were better than those shown previously.
Insulin treatment has been shown to be an independent risk factor for falls, but in this study that was not the case. The Health ABC Study reported that insulin-treated individuals with HbA1c ≤ 6.0% were approximately 4.4 times more likely to fall than those without such treatment 10 . Hypoglycemia may be involved in the high incidence of falls in older adult diabetic patients on insulin therapy. Older adults are less aware of hypoglycemia and have a higher blood glucose threshold for impaired consciousness than younger persons. The current study participant population was comprised of a large number of patients younger than those in previous studies, making them less likely to be aware of hypoglycemia. In addition, the hospitalizations in the previous year may have resulted in fewer hypoglycemic incidents due to appropriate education and optimization of treatment.
In the current study, the AUCs of logistic regression and the random forest classifier were comparable. This result was in line with the results of previous studies, which showed that the accuracy of the predictive models did not change when comparing machine learning with traditional statistical methods [39][40][41] . However, in the current study, the random forest classifier newly selected "grip strength" and "retinopathy" as covariates, which were different from the results of logistic regression. This suggests that use of machine learning in addition to logistic regression can help identify new risk factors.

Limitations of this study.
As for limitations of the study, first, the tracking rate was not as high as desired.
The follow-up rate of a cohort survey is usually required to be at least 80% whereas the current follow-up rate was 70%. This may have resulted in a selectivity bias. Second, since we mailed the questionnaire one year after discharge, recall bias may have been present. A previous study showed that 13% of participants with confirmed falls could not recall having a fall at the end of the study (12 months) 42 . Therefore, we may have underestimated the rate of falls. Third, the follow-up period was limited to one year. Extending the follow-up period might have shown increased rates of falls and uncovered new predictors of falls. Fourth, the number of participants in our study was too small to create a useful predictive model of falls. The logistic regression model considered main effects only to avoid potential overfitting due to the small sample size in relation to the number of possible combinations of predictors when interaction terms were included. We consider that the random forest model can capture such interactions effectively, and hence it is possible to investigate predictive risk factors from different perspectives by combining the two models, which have different mechanisms. Although our analysis may identify major risk factors, as similar variables were selected from both models, more detailed interaction effects should be investigated and validated using larger samples. Such future studies are needed to confirm our findings and to develop a useful predictive model for falls. Fifth, the default hyperparameter settings, which were used in random forest, may cause overfitting. However, we expected that, even if each decision tree overfitted the bootstrapped training dataset, the majority-voting with a sufficiently large ensemble can reduce the negative effect of the overfitting and bring out the predictive power of the decision trees. As n_estimators = 100 in the default setting are relatively large considering that the number of participants in this study was 226, we consider that overfitting may be avoided in the resultant prediction model. Moreover, using tenfold CV can lessen the possibility of overfitting. However, we cannot exclude the possibility that the results obtained in this study are overfitted for our participants. The results need to be validated in other cohorts. Sixth, study participants were hospitalized with poor glycemic control. Therefore, it is unclear whether the results can be extrapolated to patients with type 2 diabetes who were attending an outpatient clinic and were not recently hospitalized. Verification of the results requires a similar group who only attended outpatient clinic. Finally, the study included patients on insulin therapy, and serum F-CPR levels may be underestimated in these patients. Moreover, non-diabetic individuals were not included, and it is unclear whether the results of this study, especially that for F-CPR, can be applied to predict falls in non-diabetic individuals.
In conclusion, we investigated the frequency of falls and their risk factors 1 year after discharge from the hospital in patients with type 2 diabetes. The results showed that falls occurred in 19% of the participants, with knee extension strength, F-CPR and dorsiflexion strength selected as predictors of falls according to logistic regression analysis. Knee extension strength, grip strength, F-CPR and dorsiflexion strength were important predictors in the random forest analysis. We newly found that F-CPR was an important predictor of falls in persons with type 2 diabetes. Muscle weakness and insulin resistance may be strongly associated with falls in type 2 diabetic patients.

Methods
Study design and participants. We conducted a questionnaire survey on falls one year after discharge of 316 persons with type 2 diabetes who had been admitted to the University of Tsukuba Hospital for the treatment of diabetes. We also evaluated their physical capabilities during the hospitalization from February 2014 to May 2018. These study participants were completely independent in walking and activities of daily living. Exclusion criteria included: (1) vitreous hemorrhages or detached retina; (2) class II or higher New York Heart Association Functional Classification; (3) being treated for malignancies; (4) taking glucocorticoids, Cushing's syndrome or acromegaly; (5) post-gastrectomy; (6) unable to walk independently without assistive devices resulting from disorders of the hip or knee joints, paralysis or paresis caused by central nervous system disorders; (7) neuropathy due to causes other than diabetes and (8) difficulty in understanding instructions. This study was approved by the Ethics Committee of the University of Tsukuba Hospital (H27-31) and conducted according to the Declaration of Helsinki. Written informed consent was obtained from all participants.
Clinical data and laboratory tests. We collected socio-demographic information, medical history, anthropometric data and information on medications that participants took on admission and at discharge. Body composition was measured using bioelectrical impedance analysis (InBody 720, Biospace, Tokyo, Japan). SMI was calculated by dividing the limb skeletal muscle mass (kg) by the square of the height (m 2 ) 43 . Diabetic retinopathy was evaluated by ophthalmologists. Ankle-brachial pressure index (ABI) and brachial-ankle pulse wave velocity (baPWV) were measured (BP-203RPE, Colin Medical Technology, Tokyo, Japan). Patients were classified as having peripheral artery disease (PAD) when the value for either of the lower limbs was less than 0.9. For baPWV, after measuring each side the value on the higher side was used. Cardiac autonomic nervous function was assessed by measuring the coefficient of variation of R-R intervals (CVR-R) at rest. Participants with CVR-R < 2.0% were classified as having cardiac autonomic neuropathy 44 .
Blood samples were collected in the morning after an overnight fast within 3 days after admission. Plasma glucose and serum total cholesterol, high-density lipoprotein-cholesterol (HDL-C), triglycerides (TG) and creatinine levels were determined using an automated analyzer (Hitachi High-Technologies, Tokyo, Japan). HbA1c was measured by high-performance liquid chromatography (Tosoh, Tokyo, Japan). Serum LDL-C levels were measured by a homogeneous assay (Sekisui Medical, Tokyo, Japan). Albumin excretion rate (AER) was measured using a turbidimetric immunoassay (Nittobo Medical, Tokyo, Japan). Diabetic nephropathy was defined as AER ≥ 30 mg/d. Estimated glomerular filtration rate (eGFR) was calculated using an equation modified for the Japanese: eGFR = 194 × sCre − 1.0949 × Age − 0.287 × 0.739 (if female) 45 . Serum C-peptide level was measured by an enzyme immunoassay (Tosoh, Tokyo, Japan).

Follow-up of falls.
Fall was defined as "coming in contact with the ground (or floor) from a standing or sitting position with a body part other than the foot in contact with the ground (floor surface) against the patient's intention" 46 . The reliability of a survey of the occurrence of falls over a previous one-year period using the recall method was confirmed 42,46 . A fall history for the previous year was obtained on hospital admission. One year after discharge, we mailed study participants a questionnaire asking about the number of falls (never, once or twice or more) experienced within that 1-year period. Diabetic polyneuropathy. Diabetic polyneuropathy (DPN) was diagnosed based on two or more of the following four criteria from The Diabetic Neuropathy Study Group in Japan 47 and Michigan Neuropathy Screening Instrument 48 : decreased vibration perception using a 128-Hz tuning fork at the bilateral medial malleoli (< 10 s); loss of tactile sensation using a 10 g monofilament at the bilateral foot; decreased or loss of bilateral Achilles jerk reflex; and numbness, pain, paresthesia or hypoesthesia in the bilateral lower limbs or feet.
Physical activity. Physical activity during hospitalization was measured using an accelerometer (Mediwalk: Terumo Co., Tokyo, Japan). Information was collected on the average number of steps, average moderate-tovigorous physical activity time (3 metabolic equivalents or more) and energy expenditure through exercise.

Muscle strength. Knee extension strength and knee extension endurance. Measurements of knee extension
strength and knee extension endurance were performed on the dominant foot side using a torque machine (Biodex System3: Sakai Medical, Tokyo, Japan). For evaluation of knee extension strength, the participant performed three consecutive knee extension operations with maximum effort in isokinetic muscle strength measurement (60°/s), and the maximum torque value (Nm/kg) was used as the representative measurement value. Knee extension endurance was determined by measuring the total work (J) from 20 continuous knee extensions with maximum effort by isokinetic muscle strength measurements (300°/s).
Dorsiflexion strength of the ankle joint. Dorsiflexion strength of the ankle joint was measured using a hand-held dynamometer (Anima Co., Tokyo, Japan). The participant adjusted the height of the chair so that the angle between the knee joint and the ankle joint was 90° while being seated with the heel on the floor and the foot was raised toward the anterior shin. The joint was placed in the maximum dorsiflexion position, and preparations were made for measurement. The examiner applied the attachment of the hand-held dynamometer to the back of the participant's foot and applied maximum pressure in the ankle plantar flexion direction to break the participant's maximum ankle dorsiflexion. The test was performed twice on each side and the highest value for the right side and left side, respectively, was averaged. www.nature.com/scientificreports/ Toe pinch force. Toe pinch force was measured using a pinch force dynamometer (Checker-kun, Nisshin Sangyo Inc., Saitama, Japan). The participant sat on a chair with arms crossed over the chest. The dynamometer was attached to the foot between the great toe and second toe while the participant remained in a seated position with hip and knee joints at 90° of flexion 49 . The test was performed twice on the left and right side, respectively, and the best result for each side was averaged.
Grip strength. Grip strength of the dominant hand was measured using a Smedley analog grip meter (ST100 T-1780, Toei Light Co., Tokyo, Japan). The maximum value (kgf) was taken as the measured value.
Balance capability. Balance capabilities were assessed by the one-leg standing time with eyes open 50 and the index of postural stability (IPS). IPS was measured using a gravicorder (GP-6000, Anima Co., Tokyo, Japan) as described elsewhere 51  Statistical analysis. The sample size for the study was determined by considering the number of cases in previous studies of falls 17 and the number of participants that could be enrolled annually in our study. Study participants were those without missing data. Continuous variables were checked for normal distribution using the Shapiro-Wilk test and were expressed as mean ± SD or median (25th percentile, 75th percentile) based on distribution. Unpaired t-test and Mann-Whitney's U test were used for continuous variables with normal distribution and those with non-normal distribution, respectively. Categorical variables were presented as numbers (percentage) and analyzed using the chi-squared test. Classification analysis was performed based on the identified covariates that were significantly different between faller and non-faller groups. In addition, we used covariates that were not significantly different between our study groups but have been reported as risks for falls in previous studies 5,53,54 . We applied two types of classification models, that is, logistic regression and a random forest classifier, to investigate relationships between fall risk and clinical parameters from different perspectives. These methods were chosen because of their capabilities to quantitatively assess the importance of variables included in the models to predict falls. Models were fit with a stepwise variable selection procedure where the initial variable subset is set as empty and explanatory covariates in the models are sequentially added to or removed from the current subset by evaluating predefined criteria. In logistic regression, we fit the model to the whole dataset and selected the coefficients with p < 0.05 in terms of the likelihood ratio test. We kept the variable with the smallest p value in each step. The performance of the random forest classifier was evaluated by tenfold cross validation. Specifically, the dataset was split into training (9/10) and test (1/10) partitions. The model was fit to the training partition and subsequently tested by the test partition. This procedure was repeated for all folds and the performance was calculated. The entire procedure was repeated 10 times, and the averaged performance was reported (10-times tenfold cross validation). In each cross validation, the variable subsets with the best estimated predictive performance regarding accuracy were identified via the stepwise selection procedure. We also reported the statistics of the selected subsets. The random forest hyperparameters in this study were the default values of the library. To compare the accuracy of the predictive ability of logistic regression and the random forest classifier, receiver operating characteristic (ROC) analysis was performed for each, and accuracy, sensitivity, specificity and the area under the curve (AUC) were determined.
For analysis, Scipy based on Python (programming language) was used, R (GLM) for logistic regression, and imbalanced-learn for the random forest classifier. The criterion for determining statistical significance was 5%. www.nature.com/scientificreports/